我们为基于语义信息(称为ConceptBeam的语义信息)提出了一个新颖的框架。目标语音提取意味着在混合物中提取目标扬声器的语音。典型的方法一直在利用音频信号的性能,例如谐波结构和到达方向。相反,ConceptBeam通过语义线索解决了问题。具体来说,我们使用概念规范(例如图像或语音)提取说话者谈论概念的演讲,即感兴趣的主题。解决这个新颖的问题将为对话中讨论的特定主题等创新应用打开门。与关键字不同,概念是抽象的概念,使直接代表目标概念的挑战。在我们的方案中,通过将概念规范映射到共享的嵌入空间,将概念编码为语义嵌入。可以使用由图像及其口语字幕组成的配对数据进行深度度量学习来构建这种独立的空间。我们使用它来桥接模式依赖性信息,即混合物中的语音段以及指定的,无模式的概念。作为我们方案的证明,我们使用与口语标题相关的一组图像进行了实验。也就是说,我们从这些口语字幕中产生了语音混合物,并将图像或语音信号用作概念指定符。然后,我们使用已识别段的声学特征提取目标语音。我们将ConceptBeam与两种方法进行比较:一种基于从识别系统获得的关键字,另一个基于声音源分离。我们表明,概念束明显优于基线方法,并根据语义表示有效提取语音。
translated by 谷歌翻译
公共网站上可用的音频数据量正在迅速增长,并且需要有效访问所需数据的有效机制。我们提出了一种基于内容的音频检索方法,该方法可以通过引入辅助文本信息来检索与查询音频相似但略有不同的目标音频,该信息描述了查询和目标音频之间的差异。虽然传统基于内容的音频检索的范围仅限于与查询音频相似的音频,但提出的方法可以通过添加辅助文本查询模型的嵌入来调整检索范围,以嵌入查询示例音频中的嵌入共享的潜在空间。为了评估我们的方法,我们构建了一个数据集,其中包括两个不同的音频剪辑以及描述差异的文本。实验结果表明,所提出的方法比基线更准确地检索配对的音频。我们还基于可视化确认了所提出的方法获得了共享的潜在空间,在该空间中,音频差和相应的文本表示为相似的嵌入向量。
translated by 谷歌翻译
我们介绍了声学场景和事件的检测和分类的任务描述(DCASE)2022挑战任务2:“用于应用域通用技术的机器状况监控的无监督异常的声音检测(ASD)”。域转移是ASD系统应用的关键问题。由于域移位可以改变数据的声学特征,因此在源域中训练的模型对目标域的性能较差。在DCASE 2021挑战任务2中,我们组织了一个ASD任务来处理域移动。在此任务中,假定已知域移位的发生。但是,实际上,可能不会给出每个样本的域,并且域移位可能会隐含。在2022年的任务2中,我们专注于域泛化技术,这些技术检测异常,而不论域移动如何。具体而言,每个样品的域未在测试数据中给出,所有域仅允许一个阈值。我们将添加挑战结果和挑战提交截止日期后提交的分析。
translated by 谷歌翻译
Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least h is not less than w for given thresholds h and w. We study a special classification bandit problem in which arms correspond to points x in d-dimensional real space with expected rewards f(x) which are generated according to a Gaussian process prior. We develop a framework algorithm for the problem using various arm selection policies and propose policies called FCB and FTSV. We show a smaller sample complexity upper bound for FCB than that for the existing algorithm of the level set estimation, in which whether f(x) is at least h or not must be decided for every arm's x. Arm selection policies depending on an estimated rate of arms with rewards of at least h are also proposed and shown to improve empirical sample complexity. According to our experimental results, the rate-estimation versions of FCB and FTSV, together with that of the popular active learning policy that selects the point with the maximum variance, outperform other policies for synthetic functions, and the version of FTSV is also the best performer for our real-world dataset.
translated by 谷歌翻译
The long-standing theory that a colour-naming system evolves under the dual pressure of efficient communication and perceptual mechanism is supported by more and more linguistic studies including the analysis of four decades' diachronic data from the Nafaanra language. This inspires us to explore whether artificial intelligence could evolve and discover a similar colour-naming system via optimising the communication efficiency represented by high-level recognition performance. Here, we propose a novel colour quantisation transformer, CQFormer, that quantises colour space while maintaining the accuracy of machine recognition on the quantised images. Given an RGB image, Annotation Branch maps it into an index map before generating the quantised image with a colour palette, meanwhile the Palette Branch utilises a key-point detection way to find proper colours in palette among whole colour space. By interacting with colour annotation, CQFormer is able to balance both the machine vision accuracy and colour perceptual structure such as distinct and stable colour distribution for discovered colour system. Very interestingly, we even observe the consistent evolution pattern between our artificial colour system and basic colour terms across human languages. Besides, our colour quantisation method also offers an efficient quantisation method that effectively compresses the image storage while maintaining a high performance in high-level recognition tasks such as classification and detection. Extensive experiments demonstrate the superior performance of our method with extremely low bit-rate colours. We will release the source code soon.
translated by 谷歌翻译
Owing to the widespread adoption of the Internet of Things, a vast amount of sensor information is being acquired in real time. Accordingly, the communication cost of data from edge devices is increasing. Compressed sensing (CS), a data compression method that can be used on edge devices, has been attracting attention as a method to reduce communication costs. In CS, estimating the appropriate compression ratio is important. There is a method to adaptively estimate the compression ratio for the acquired data using reinforcement learning. However, the computational costs associated with existing reinforcement learning methods that can be utilized on edges are expensive. In this study, we developed an efficient reinforcement learning method for edge devices, referred to as the actor--critic online sequential extreme learning machine (AC-OSELM), and a system to compress data by estimating an appropriate compression ratio on the edge using AC-OSELM. The performance of the proposed method in estimating the compression ratio is evaluated by comparing it with other reinforcement learning methods for edge devices. The experimental results show that AC-OSELM achieved the same or better compression performance and faster compression ratio estimation than the existing methods.
translated by 谷歌翻译
我们提出了一种轻巧,准确的方法,用于检测视频中的异常情况。现有方法使用多个实体学习(MIL)来确定视频每个段的正常/异常状态。最近的成功研​​究认为,学习细分市场之间的时间关系很重要,以达到高精度,而不是只关注单个细分市场。因此,我们分析了近年来成功的现有方法,并发现同时学习所有细分市场确实很重要,但其中的时间顺序与实现高准确性无关。基于这一发现,我们不使用MIL框架,而是提出具有自发机制的轻质模型,以自动提取对于确定所有输入段正常/异常非常重要的特征。结果,我们的神经网络模型具有现有方法的参数数量的1.3%。我们在三个基准数据集(UCF-Crime,Shanghaitech和XD-Violence)上评估了方法的帧级检测准确性,并证明我们的方法可以比最新方法实现可比或更好的准确性。
translated by 谷歌翻译
了解用户的意图并从句子中识别出语义实体,即自然语言理解(NLU),是许多自然语言处理任务的上游任务。主要挑战之一是收集足够数量的注释数据来培训模型。现有有关文本增强的研究并没有充分考虑实体,因此对于NLU任务的表现不佳。为了解决这个问题,我们提出了一种新型的NLP数据增强技术,实体意识数据增强(EADA),该技术应用了树结构,实体意识到语法树(EAST),以表示句子与对实体的注意相结合。我们的EADA技术会自动从少量注释的数据中构造东方,然后生成大量的培训实例,以进行意图检测和插槽填充。四个数据集的实验结果表明,该技术在准确性和泛化能力方面显着优于现有数据增强方法。
translated by 谷歌翻译
期望 - 最大化(EM)算法是一种简单的元叠加,当观察到的数据中缺少测量值或数据由可观察到的数据组成时,它已多年来用作统计推断的方法。它的一般属性进行了充分的研究,而且还有无数方法将其应用于个人问题。在本文中,我们介绍了$ em $ $ and算法,EM算法的信息几何公式及其扩展和应用程序以及各种问题。具体而言,我们将看到,可以制定一个异常稳定推理算法,用于计算通道容量的算法,概率单纯性的参数估计方法,特定的多变量分析方法,例如概率模型中的主要组件分析和模态回归中的主成分分析,基质分解和学习生成模型,这些模型最近从几何学角度引起了深度学习的关注。
translated by 谷歌翻译
现有的视频域改编(DA)方法需要存储视频帧的所有时间组合或配对源和目标视频,这些视频和目标视频成本昂贵,无法扩展到长时间的视频。为了解决这些局限性,我们建议采用以下记忆高效的基于图形的视频DA方法。首先,我们的方法模型每个源或目标视频通过图:节点表示视频帧和边缘表示帧之间的时间或视觉相似性关系。我们使用图形注意力网络来了解单个帧的重量,并同时将源和目标视频对齐到域不变的图形特征空间中。我们的方法没有存储大量的子视频,而是仅构建一个图形,其中一个视频的图形注意机制,从而大大降低了内存成本。广泛的实验表明,与最先进的方法相比,我们在降低内存成本的同时取得了卓越的性能。
translated by 谷歌翻译